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Abstract 

Solomonoff 's uncomputable universal prediction scheme ^ allows to predict the next 
symbol of a sequence xi-.-x^-i for any Turing computable, but otherwise un- 
known, probabilistic environment fi. This scheme will be generalized to arbitrary 
environmental classes, which, among others, allows the construction of computable 
universal prediction schemes ^. Convergence of ^ to in a conditional mean squared 
sense and with probability 1 is proven. It is shown that the average number of 
prediction errors made by the universal ^ scheme rapidly converges to those made by 
the best possible informed scheme. The schemes, theorems and proofs are given 
for general finite alphabet, which results in additional complications as compared to 
the binary case. Several extensions of the presented theory and results are outlined. 
They include general loss functions and bounds, games of chance, infinite alphabet, 
partial and delayed prediction, classification, and more active systems. 



^This work was supported by SNF grant 2000-61847.00 to Jiirgen Schmidhuber. 



1 Introduction 



The Bayesian framework is ideally suited for studying induction problems. The probability 
of observing Xk at time k, given past observations xi...Xk-i, can be computed with Bayes' 
rule if the generating probability distribution /i, from which sequences X1X2X3... are drawn, 
is known. The problem, however, is that in many cases one does not even have a reasonable 
estimate of the true generating distribution. What is the true probability of weather 
sequences or stock charts? In order to overcome this problem we define a universal 
distribution ^ as a weighted sum of distributions /ij G M, where M is any finite or countable 
set of distributions including fi. This is a generalization of Solomonoff induction, in which 
M is the set of all enumerable semi-measures ||Sol64| , |Sol78|| . We show that using the 



universal as a prior is nearly as good as using the unknown generating distribution fi. 
In a sense, this solves the problem, that the generating distribution /i is not known, in a 
universal way. All results are obtained for general finite alphabet. Convergence of ^ to 
/X in a conditional mean squared sense and with fi probability 1 is proven. The number 
of errors Eq^ made by the universal prediction scheme based on ^ minus the number 
of errors Eq^ of the optimal informed prediction scheme based on /i is proven to be 
bounded by 



E, 



Extensions to arbitrary loss functions, games of chance, infinite alphabet, partial and 
delayed prediction, classification, and more active systems are discussed (Section |). The 
main new results are a generalization of the universal probability ^ [ [Sol64|| to arbitrary 
probability classes and weights (Section §), a generalization of the convergence |pol78|] 
C, ^ fj^ (Section ^ and the error bounds |[Hut99|| to arbitrary alphabet (Section ^). The 
non-binary setting causes substantial additional complications. Non-binary prediction 
cannot be (easily) reduced to the binary case. One may have in mind a binary coding 
of the symbols x^ in the sequence XiX2---- But this makes it necessary to predict a block 
of bits Xk, before receiving the true block of bits x^, which differs from the bit-by-bit 
prediction considered in ||Sol78| , [LV97| , [Hut99|| . 

For an excellent introduction to Kolmogorov complexity and Solomonoff induction one 
should consult the book of Li and Vitanyi ||LV97|| or the article [ [LV92|| for a short course. 



Historical surveys of inductive reasoning and inference can be found in [AS83, 5ol97 



2 Setup 

2.1 Strings and Probability Distributions 

We denote strings over a finite alphabet A by xiX2--.Xn with Xk eA. We further use the 
abbreviations Xn-.m '■= XnXn+i---Xm-iXm and x<„ := xi...Xn-i- We use Greek letters for 
probability distributions and underline their arguments to indicate that they are proba- 
bility arguments. Let p{ xi...Xn ) be the probability that an (infinite) sequence starts with 

X\ . ..Xyi- 

We also need conditional probabilities derived from Bayes' rule. We prefer a notation 
which preserves the order of the words, in contrast to the standard notation p(-|-) which 
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flips it. We extend the definition of p to the conditional case with the following convention 
for its arguments: An underlined argument probability variable and other non- 

underlined arguments Xk represent conditions. With this convention, Bayes' rule has the 
following look: 

p{.X<nXn) = p{,Xi,n)/ p{x<n) , p{ xi...Xn ) = pfe) " pfXjXa) ■ . . . ■ . .Xn-jX^) . (2) 

The first equation states that the probability that a string xi...x„_i is followed by x„ is 
equal to the probability that a string starts with xi...Xn divided by the probability that 
a string starts with xi...Xn-i- The second equation is the first, applied n times. 



2.2 Universal Prior Probability Distribution 

Most inductive inference problem can be brought into the following form: Given a string 
x<fc, take a guess at its continuation Xk- We will assume that the strings which have to be 
continued are drawn from a probability^ distribution p. The maximal prior information 
a prediction algorithm can possess is the exact knowledge of p, but in many cases the 
generating distribution is not known. Instead, the prediction is based on a guess p of p. 
We expect that a predictor based on p performs well, if p is close to p or converges, in 
a sense, to p. Let M := {pi, p2, ...} be a finite or countable set of candidate probability 
distributions on strings. We define a weighted average on M 

^{Xl:n) ■■= J2 ^f^.-^ii^l:n), H = 1, Wf,^ > 0. (3) 

It is easy to see that ^ is a probability distribution as the weights w^. are positive and 
normalized to 1 and the piE M are probabilities. For finite M a possible choice for the 
w is to give all pi equal weight {w^- = j^). We call ^ universal relative to M, as it 
multiplicatively dominates all distributions in M 

C{xi.,n) > Wf,^-pi{x^,„) for all p, e M. (4) 

In the following, we assume that M is known and contains^ the true generating distri- 
bution, i.e. p e M. We will see that this is not a serious constraint as we can always 
chose M to be sufficiently large. In the next section we show the important property of 
^ converging to the generating distribution /i in a sense and, hence, might being a useful 
substitute for the true generating, but in general, unknown distribution p. 



2.3 Probability Classes 

We get a rather wide class M if we include all computable probability distributions in M. 
In this case, the assumption yU e M is very weak, as it only assumes that the strings are 

^This includes deterministic environments, in which case the probabihty distribution /x is 1 for some 
sequence xi-.oo and for all others. We call probability distributions of this kind deterministic. 

■^Actually all theorems remain valid for fx being a finite linear combination of fit G L C_ AI and 
Wf, :=min^^eLW^, [gut01|. 
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drawn from any computable distribution; and all valid physical theories (and, hence, all 
environments) are computable (in a probabilistic sense). 

We will see that it is favorable to assign high weights w^. to the Hi. Simplicity should 
be favored over complexity, according to Occam's razor. In our context this means that 
a high weight should be assigned to simple fii. The prefix Kolmogorov complexity K{ni) 
is a universal complexity measure |[Kol65| , |ZL7CI| , [LV97|| . It is defined as the length of the 
shortest self- delimiting program (on a universal Turing machine) computing ^i{xi-n) given 
Xi,n- If we define 



then, distributions which can be calculated by short programs, have high weights. Besides 
ensuring correct normalization, VL (sometimes called the number of wisdom) has interest- 
ing properties in itself [|Cal98| , |Cha91|| . If we enlarge M to include all enumerable semi- 
measures, we attain Solomonoff 's universal probability, apart from normalization, which 
has to be treated differently in this case ||Sol64| , pol78|| . Recently, M has been further 
enlarged to include all cumulatively enumerable semi-measures ||SchOO|| . In all cases, ^ is 



not finitely computable, but can still be approximated to arbitrary but not pre-specifiable 
precision. If we consider all approximable (i.e. asymptotically computable) distributions, 
then the universal distribution ^, although still well defined, is not even approximable 
SchUO|| . An interesting and quickly approximable distribution is the Speed prior 5* de- 
fined in PchOQ] . It is related to Levin complexity and Levin search |[Lev73| , [Lev84|] , but it 
is unclear for now which distributions are dominated by S. If one considers only finite- 
state automata instead of general Turing machines, one can attain a quickly computable, 
universal finite-state prediction scheme related to that of Feder et al. |FMG92|, which 



itself is related to the famous Lempel-Ziv data compression algorithm. If one has extra 
knowledge on the source generating the sequence, one might further reduce M and in- 
crease w. A detailed analysis of these and other specific classes M will be given elsewhere. 
Note that ^ G M in the enumerable and cumulatively enumerable case, but ^ ^ M in the 
computable, approximable and finite-state case. If ^ is itself in M, it is called a universal 
[LV97|| . As we do not need this property here, M may be any finite or 



element of M 



countable set of distributions. In the following we consider generic M and w. 



3 Convergence 

3.1 Upper Bound for the Relative Entropy 

Let us define the relative entropy (also called Kullback Leibler divergence p:<-ul59| | ) between 
11 and ^: 

hk{x<k) ■■= f^i^<kXk) In . (5) 

Hn is then defined as the sum-expectation, for which the following upper bound can be 
shown 

Hn ■■= H f^{x<k)-hk{x<k) = J2 f^i^i.k) In ^l^^^^-^l = (6) 

k=l a;<fce^'=-l k=l Xi.,,,&A^ K\^<kXk} 
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= EManJlnnfr^ = EMaJlnfm < In- =: d, 

xi;„ i^—i<,{X<k2Lk) xi:n Sl3il:nj 

In the first line we have inserted and used Bayes' rule ■/i(x<fcXfc) = fi{xi,f.). 

Due to (0), we can further replace J^xi.^ f^i^i-.k) by J2xi.n f^i^i-.n) the argument of the 
logarithm is independent of Xk+i-.n- The k sum can now be exchanged with the Xi^n 
sum and transforms to a product inside the logarithm. In the last equality we have 
used the second form of Bayes' rule (Q) for /i and ^. Using universahty (^) of ^, i.e. 
ln/i(a;i.„)/^(xi.„) < In ^ for /igM yields the final inequality in (H). The proof given here 
is simplified version of those given in |pol78|| and |[LV97| . 



3.2 Lower Bound for the Relative Entropy 

We need the following inequality to lower bound if„ 

N N y N N 

i=l i=l 1=1 i=l 

The proof of the case N = 2 

2{y-zf < yln- + {l-y)\n^—^, < y < 1, 0<z<l (8) 
z 1 — z 



will not be repeated here, as it is elementary and well known ||LV97 |. The proof of (|^) 



is one point where the generalization from binary to arbitrary alphabet is not trivial^ 
We will reduce the general case > 2 to the case N = 2. We do this by a partition 
{1, ...,N} = G+UG', G+ nG- = {}, and define y^ := ^ Vi and z^ := ^ ^i- It is well 

ieG± ieG± 

known that the relative entropy is positive, i.e. 

Epan->0 for pi>0, >0, ^ Pi = 1 = E (9) 

igG± ieG± iGG± 

Note that there are 4 probability distributions {pi and for i G G^ and i G G^). For 
iGG^, Pi := l/i/l/^ and := -Zj/^^ satisfy the conditions on p and g. Inserting this into 
(^) and rearranging the terms we get I]ieG± !/« In >y''^ In If we sum this over ± and 
define y = y^ = l — y^ and z = z~^ = l — z~ we get 

Ey.ln^ > T.y^ln^ = 2/ln^ + (l-i/)ln— ^. (10) 

i—l Zi Z Z L Z 

For the special choice G^:={i : Vi^Zi}, we can upper bound the quadratic term by 

igG± ^gG± jeG± 

''We will not explicate every subtlety and only sketch the proofs. Subtleties regarding y,z = 0/1 have 
been checked but will be passed over. Oln — :=0 even for Zi — 0. Positive means > 0. 



E ivi-^if < ( E Ivi-^ilf = ( E yi~^^y = (y^-z 



±\2 
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The first equality is true, since all y—Zi are positive/negative for ieG^ due to the special 
choice of G^. Summation over ± gives 



N 

Y.{m-z,f < Y.iy^-z^r = 2{y-zf (11) 
i=l ± 

Chaining the inequalities ([TT|), (j^) and (p!OD proves (^. If we identify 

^ = {1, A^}, A^=|^|, i = Xfc, ?/i = /i(a;<fcXfc), = ^(a;<fcXfc) (12) 
multiply both sides of (|^) with fi{x^k) and take the sum over x^k and A; we get 

J2Y.f^(^<k){fJ'i^<kXk) - ^{x<kXk)) < f^i^i-.k) In . (13) 

A:=la:^i:fc fc=l a:i:fe <;[X<kXk) 

3.3 Convergence of ^ to 



The upper (||) and lower (|T3D bounds on Hn allow us to prove the convergence of ^ to 
in a conditional mean squared sense and with fi probability 1. 

Theorem 1 (Convergence) Let there be sequences X\X2--- over a finite alphabet A 
drawn with probability /i(xi.„) for the first n symbols. The universal conditional prob- 
ability i{x<:kXjt) of the next symbol x^ given x^k is related to the generating conditional 
probability fi{x^kXk) in the following way: 



" 2 1 

^) J2J2f^(^<k){f^i^<kXk) - ^{x<kXk)) < < df, = \n — < 00 

k=ixi.,k ""^M 

a) C,{x<^kXk) ~^ fJ'ix^kXk) for A; — > 00 with ft probability 1 
where Hn is the relative entropy ([^, and is the weight of fi in ^. 

(i) follows from @ and (0). For n (yo the l.h.s. of {i) is an infinite fc-sum over 
positive arguments, which is bounded by the finite constant 0?^ on the r.h.s. Hence the 
arguments must converge to zero for k^oo. Since the arguments are n expectations of 
the squared difference of ^ and /i, this means that ^{x^^kXk) converges to fi{x^kXk) with fi 
probability 1 or, more stringent, in a mean square sense. This proves (ii). The reason for 
the astonishing property of a single (universal) function ^ to converge to any fii&M lies 
in the fact that the sets of //-random sequences differ for different fi. Since the conditional 
probabilities are the basis of all prediction algorithms considered in this work, we expect 
a good prediction performance if we use ^ as a guess of fi. Performance measures are 
defined in the following sections. 

4 Error Bounds 

We now consider the following measure for the quality of a prediction: making a wrong 
prediction counts as one error, making a correct prediction counts as no error. 
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4.1 Total Expected Numbers of Errors 

Let Qfj_ be the optimal prediction scheme when the strings are drawn from the probabihty 
distribution /i, i.e. the probabihty of Xk given x<a,. is /i(x<fcXfc), and fi is known. 6^ 
predicts (by definition) x^** when observing x<fc. The prediction is erroneous if the true 
^th gy]2ibol is not Xj^^. The probabihty of this event is 1 — ^i{x^kXk'')- is minimized 
if x^'' maximizes iJ.{x^k3^^)- More generaUy, let 9p be a prediction scheme predicting 
x^'' := maxarg^.^ p{x<k2lk) fo^' some distribution p. Every deterministic predictor can 
be interpreted as maximizing some distribution. The n probability of making a wrong 
prediction for the A;*'* symbol and the total /x-expected number of errors in the first n 
predictions of predictor 6p are 

n 

efc0,(x<fc) := l-nix^kxt") , EnSp ■= f^U<k)-eke,i^<k)- (14) 

k = l Xl...Xk-l 

If /i is known, 0^ is obviously the best prediction scheme in the sense of making the least 
number of expected errors 

EnB^ < EnBp for any 6p, (15) 
since efcep(a;<fc) = l-/i(a:<fcX®'') = min(l-/i(x<feXfc)) < l-;u(x<fcX®'') =efcep(a;<fc) for any p. 

4.2 Error Bound 

Of special interest is the universal predictor 6^. As ^ converges to /i the prediction of 
0g might converge to the prediction of the optimal O^. Hence, 6^ may not make many 
more errors than 0^ and, hence, any other predictor Bp. Note that x^'' is a discontinuous 
function of p and not be proved from ^ — /i. Indeed, this problem 

occurs in related prediction schemes, where the predictor has to be regularized so that it 
is continuous |[FMG92|| . Fortunately this is not necessary here. We prove the following 
error bound. 



Theorem 2 (Error bound) Let there be sequences XiX2--- over a finite alphabet A 
drawn with probability p{x_i.^ for the first n symbols. The Qp-system predicts by defi- 
nition x^p from x^n, where x®'' maximizes p{x<:nXn)- ^■^ the universal prediction 
scheme based on the universal prior C,- Op is the optimal informed prediction scheme. 
The total p-expected number of prediction errors En@^ and En@^ of and Op as defined 
in ( [T^ j are bounded in the following way 



where Hn<iTi is the relative entropy (j^, and is the weight ^ of p in ^. 

First, we observe that the number of errors -Eooe^ of the universal O^ predictor is finite 
if the number of errors -Eooe^ of the informed Op predictor is finite. This is especially 
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the case for deterministic /i, as £'„0^ = O in this caseQ, i.e. makes only a finite number 
of errors on deterministic environments. More precisely, Eoos^ < 2ifoo < 21n;j^. A 
combinatoric argument shows that there are M and G M with -Eooe^ ^ log2 1^1- This 
shows that the upper bound E^q^ < 2 In |M| for uniform w must be rather tight. For more 
complicated probabilistic environments, where even the ideal informed system makes an 
infinite number of errors, the theorem ensures that the error excess -Eno^ — EnSf, is only of 
order ^Je^^. The excess is quantified in terms of the information content Hn of /i (relative 
to ^) , or the weight of /i in ^. This ensures that the error densities En/ n of both systems 
converge to each other. Actually, the theorem ensures more, namely that the quotient 

-1 /o 

converges to 1, and also gives the speed of convergence Ens^/ Ens^ = 1 + O^E^q^ ) — ^ 1 
for Ens^ oo. 



4.3 Proof of Theorem g 

The first inequality in Theorem |^ has already been proved ([TSI). The last inequality is a 
simple triangle inequality. For the second inequality, let us start more modestly and try 
to find constants A and B that satisfy the linear inequality 

EnB, < iA + l)Ene^ + {B + l)Hn. (16) 

If we could show 



for all k<n and all x<fc, (|I6|) would follow immediately by summation and the definition 
of En and Hn- With the abbreviations (|12|) and the abbreviations m = x®** and s = x^* 
the various error functions can then be expressed by Cke^ = ^ — Vs, ^ke^, = ^ — ym and 
hk = I]i 2/i In ^. Inserting this into (|17|) we get 

^ V- 

l-Vs < iA+l)il-ym) + iB + l)J2y^^^-■ (18) 

i=l 

By definition of x^^ and x^* we have ym^Vi and Zs>Zi for all i. We prove a sequence of 
inequalities which show that 

^ V- 

{B + l)J2yM- + iA+l){l~yJ-{l-y,) > ... (19) 



is positive for suitable A>0 and -B > 0, which proves (|T8]). For m = s (|T^) is obviously 
positive since the relative entropy is positive {h^. > 0). So we will assume m 7^ s in the 
following. We replace the relative entropy by the sum over squares (^ and further keep 
only contributions from i = m and i = s. 

... > {B+l)[{ym-Zm)^ + {ys-Zsy] + iA + l){l-yn,) - (l-ys) > ... 



^Remember that we named a probability distribution deterministic if it is 1 for exactly one sequence 
and for all others. 
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By definition of m and s we liave tlie constraints ?/m+l/s<l, Zm+Zs<l, ym>ys>0 
and Zs>Zm>0. From tlie latter two it is easy to see that the square terms (as a function 
of Zm and Zs) are minimized by Zm = Zs = ^{ym + Vs)- Furthermore, we define x\=ym—ys 
and eliminate ys- 

... > {B + l)\x^ + A{l-y^)-x > ... (20) 
The constraint on ym+Us < 1 translates into ym < hence (^) is minimized by ym = ^2^- 

... > l[{B + l)x^ ~{A+2)x + A] > ... (21) 

(pip is quadratic in x and minimized by x* = 2(b+i) ■ Inserting x* gives 

AAB — —4 

- - ^CbTT)~ - ° ^^i^ + i' (^^^1)- (22) 

Inequality ([T6|) therefore holds for any A > 0, provided we insert B = ^A + ^. Thus we 
might minimize the r.h.s. of (|T6|) w.r.t. A leading to the upper bound 

H„ 



Ene, < E^e+H^ + JAE^.Hn + m for A' 



^n/i-i-it i — n 7-1 I i IT 

which completes the proof of Theorem ^ □. 

5 Generalizations 

In the following we discuss several directions in which the findings of this work may be 
extended. 



5.1 General Loss Function 

A prediction is very often the basis for some decision. The decision results in an action, 
which itself leads to some reward or loss. To stay in the framework of (passive) predic- 
tion we have to assume that the action itself does not influence the environment. Let 
^Xfe3/fe(^<fc) ^ [^mm, ^mm + ^A] bc the rcccived loss when taking action yu^y and Xk & A 
is the k*^ symbol of the sequence. For instance, if we make a sequence of weather 
forecasts A = {sunny, rainy} and base our decision, whether to take an umbrella or 
wear sunglasses y = {umbrella, sunglasses} on it, the action of taking the umbrella or 
wearing sunglasses does not influence the future weather (ignoring the butterfly effect). 
The error assignment of section ^ falls into this class. The action was just a prediction 
(3^ = .4) and a unit loss was assigned to an erroneous prediction {Ix^yi, = 1 for Xk 7^ yk) 
and no loss to a correct prediction {Ix/^xk — ^)- general, a Ap action/prediction scheme 
y^'' := minarg^^ J^x^ p{^<k2lk)^xkyk '^^^ defined that minimizes the p-expected loss. A^ 
is the universal scheme based on the universal prior ^. A^ is the optimal informed scheme. 



In |p3ut01|| it is proven that the total /i-expected losses L„a^ and LnA^ of A^ and A^ are 



bounded in the following way: < L„a^ -L„a^ < l^H^ + \l ^{Lnk^,-nlmin)liiHn + l\Hl. 
The loss bound has a similar form as the error bound of Theorem |^, but the proof is much 
more evolved. 
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5.2 Games of Chance 



The general loss bound stated in the previous subsection can be used to estimate the 
time needed to reach the winning threshold in a game of chance (defined as a sequence of 
bets, observations and rewards). In step k we bet, depending on the history a;<fc, a certain 
amount of money Sk, take some action y^, observe outcome x^, and receive reward r^. Our 
profit, which we want to maximize, is Pk = 'rk — Sk & [Pmax—PA,Pmax]- The loss, which we 
want to minimize, can be identified with the negative profit, Ix^y,, = ~Pk- The Ap-system 
acts as to maximize the p-expected profit. Let p„Ap be the average expected profit of the 
first n rounds. One can show that the average profit of the system converges to the 
best possible average profit p^A^ achieved by the A^ scheme {PnA^—PnA^ = 0(n~^/^) 
for n — i> oo). If there is a profitable scheme at all, then asymptotically the universal A^ 
scheme will also become profitable with the same average profit. In ||Hut01|| it is further 
shown that (j^^-d^ is an upper bound for the number of bets n needed to reach the 

winning zone. The bound is proportional to the relative entropy of fi and ^. 



5.3 Infinite Alphabet 

In many cases the basic prediction unit is not a letter, but a number (for inducing number 
sequences), or a word (for completing sentences), or a real number or vector (for physical 
measurements). The prediction may either be generalized to a block by block prediction 
of symbols or, more suitably, the finite alphabet A could be generalized to countable 
(numbers, words) or continuous (real or vector) alphabet. The theorems should gener- 
alize to countably infinite alphabets by appropriately taking the limit |^| ^ oo and to 
continuous alphabets by a denseness or separability argument. 



5.4 Partial Prediction, Delayed Prediction, Classification 

The Ap schemes may also be used for partial prediction where, for instance, only every 
m*^ symbol is predicted. This can be arranged by setting the loss l'^ to zero when no 
prediction is made, e.g. if k is not a multiple of m. Classification could be interpreted 
as partial sequence prediction, where x (k-i)m+i:km-i is classified as Xkm- There are better 
ways for classification by treating X(k-i)m+i:km-i as pure conditions in ^, as has been done 
in ||HutOO|| in a more general context. Another possibility is to generalize the prediction 



schemes and theorems to delayed sequence prediction, where the true symbol Xk is given 
only in cycle k + d. A delayed feedback is common in many practical problems. 



5.5 More Active Systems 

Prediction means guessing the future, but not infiuencing it. A tiny step in the direction 
to more active systems, described in subsection p?T| , was to allow the A system to act and 
to receive a loss Ix^y,, depending on the action yk and the outcome Xk- The probability 
/J, is still independent of the action, and the loss function has to be known in advance. 
This ensures that the greedy strategy is optimal. The loss function may be generalized 
to depend not only on the history x^k, but also on the historic actions y^k with fi still 
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independent of the action. It would be interesting to know whether the scheme A and/or 
the loss bounds generalize to this case. The full model of an acting agent influencing the 
environment has been developed in | HutOO |, but loss bounds have yet to be proven. 



5.6 Miscellaneous 

Another direction is to investigate the learning aspect of universal prediction. Many 
prediction schemes explicitly learn and exploit a model of the environment. Learning 
and exploitation are melted together in the framework of universal Bayesian prediction. 
A separation of these two aspects in the spirit of hypothesis learning with MDL [|VLO(]| 
could lead to new insights. Finally, the system should be tested on specific induction 
problems for specific M with computable ^. 



6 Summary 

Solomonoff 's universal probability measure has been generalized to arbitrary probability 
classes and weights. A wise choice of M widens the applicability by reducing the computa- 
tional burden for ^. Convergence of to and error bounds have been proven for arbitrary 
finite alphabet. They show that the universal prediction scheme Ag is an excellent sub- 
stitute for the best possible (but generally unknown) informed scheme A^. Extensions 
and applications, including general loss functions and bounds, games of chance, infinite 
alphabet, partial and delayed prediction, classification, and more active systems, have 
been discussed. 
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